Exercises



In [ ]:

Simple reading

The file ../data/coordinates.txt contains list of (x, y) value pairs. Read the values into two lists x and y.



In [ ]:

    
xs = []
ys = []
with open("../data/coordinates.txt", "r") as f:
    for line in f:
        line = line.split()
        xs.append(float(line[0]))
        ys.append(float(line[1]))
print(xs)
print(ys)

Nontrivial reading and conversion

The file ../data/CH4.pdb contains the coordinates of methane molecule in a PDB format. The file consists of header followed by record lines which contain the following fields:

record name(=ATOM), atom serial number, atom name, x-,y-,z-coordinates, occupancy and temperature factor.

i.e.

ATOM      2 H                   -0.627  -0.627   0.627  0.00  0.00

Convert the file into XYZ format: first line contains the number of atoms, second line is title string, and the following lines contain the atomic symbols and x-, y-, z- coordinates, all separated by white space. Write the coordinates with 6 decimals:

5
Converted from PDB
C    0.000000   0.000000   0.000000
...



In [ ]:

    
infile = '../data/CH4.pdb'
outfile = infile.replace('.pdb', '.xyz')
atoms = []
with open(infile, "r") as f:
    for line in f:
        if 'ATOM' in line:
            line = line.split()
            symbol = line[2]
            coords = [float(x) for x in line[3:6]]
            atoms.append((symbol, coords))
            
with open(outfile, "w") as f:
    f.write("{0}\n".format(len(atoms)))
    f.write("Converted from PDB\n")
    for atom in atoms:
        f.write("{0:2s} {1:10.6f} {2:10.6f} {3:10.6f}\n".format(atom[0],
                      atom[1][0], atom[1][1], atom[1][2]))

Bonus exercises

Delimiter separated values

Many data exchange formats are so-called delimiter separated values. The most commonly known of these is CSV.

There are multiple caveats in the format, e.g. European languages use comma (,) as a decimal separator and semicolon (;) as the field separator. Most pure-English systems use the dot (.) for decimal separation and the comma (,) for field separation.

Another family of systems uses whitespace, like space or tab characters to separate fields.

Python's csv library supports most of the variance in different formats and it can be a time-saving tool to those who use Python and deal with file formats a lot.

The file "../data/iris.data" is actually in CSV format even though the file ending doesn't explicitly say so (this is common).

Read in iris.data and write out a tab-separated file "iris.tsv" using the csv module.

Hint: because the first line of the input file has labels, csv.DictReader and csv.DictWriter are a good choice.



In [ ]:

    
import csv
irises = []
with open("../data/iris.data") as inputfile:
    chreader = csv.DictReader(inputfile)
    for line in chreader:
        irises.append(line)
        
print(irises[0])
with open("../data/iris.tsv", "w") as outputfile:
    writer = csv.DictWriter(outputfile, delimiter="\t", fieldnames=["sepal.length","sepal.width","petal.length","petal.width","class"])
    writer.writeheader()
    for iris in irises:
        writer.writerow(iris)

The file ../data/word_count.txt contains a short piece of text. Determine the frequency of words in the file, i.e. how many times each word appears. Print out the ten most frequent words.

Read the file line by line and use the split() function for separating a line into words. The frequencies are stored most conveniently into a dictionary. The dictionary method setdefault can be useful here.

For sorting, convert the dictionary into a list of (key, value) pairs with the items() function:

words = {"foo" : 1, "bar" : 2}
print(words.items())
[('foo', 1), ('bar', 2)]



In [ ]:

    
words = {}
with open("../data/word_count.txt", "r") as f:
    for line in f:
        line = line.split()
        for word in line:
            words.setdefault(word, 0)
            words[word] += 1

word_list = [(value, key) for key, value in words.items()]
word_list.sort()
word_list.reverse()
for freq, word in word_list[:10]:
    word = '"%s"' % word
    print("The word {0:^15} appears {1:5} times".format(word, freq))

Reading nucleotide sequences

Fasta is a fileformat for storing nucleotide sequences. The sequences consist of header line, starting with >, followed by one or more lines containing the amino acids of the sequence presented by single-letter codes:

>5IRE:A|PDBID|CHAIN|SEQUENCE
IRCIGVSNRDFVEGMSGGTWVDVVLEHGGCVTVMAQDKPTVDIELVTTTVSNMAEVRSYCYEASISDMASDSRCPTQGEA
YLDKQSDTQYVCKRTLVDRGWGNGCGLFGKGSLVTCAKFACSKKMTGKSIQPENLEYRIMLSVHGSQHSGMIVNDTGHET
...

The file ../data/5ire.fasta contains sequences for multiple chains of Zika virus. Read from the file the sequence of chain C (the chain ids are given in the header, i.e. the chain above is A).

Find out which chains contain the subsequence LDFSDL.

Hint: as the sequence is given in multiple lines, you should combine all the lines of a sequence into a single string. String object's .strip() method which removes newlines from the end of string is useful here.



In [ ]:

    
chains = {}
with open("../data/5ire.fasta", "r") as f:
    for line in f:
        if line.startswith('>'):
            # We have a header
            key = line.split('|')[0].split(':')[1]            
            chains[key] = ""
        else:
            chains[key] += line.strip()

print('Chain C:')
print(chains['C'])
print()

subsequence = 'LDFSDL'
for key, sequence in chains.items():
    if subsequence in sequence:
        print("Chain {0} contains subsequence {1}".format(key, subsequence))